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Abstract 

We consider the problem of finding optimal energy sharing policies that maximize 
the network performance of a system comprising of multiple sensor nodes and a sin¬ 
gle energy harvesting (EH) source. Sensor nodes periodically sense the random field 
and generate data, which is stored in the corresponding data queues. The EH source 
harnesses energy from ambient energy sources and the generated energy is stored in 
an energy buffer. Sensor nodes receive energy for data transmission from the EH 
source. The EH source has to efficiently share the stored energy among the nodes 
in order to minimize the long-run average delay in data transmission. We formulate 
the problem of energy sharing between the nodes in the framework of average cost 
infinite-horizon Markov decision processes (MDPs). We develop efficient energy shar¬ 
ing algorithms, namely Q-learning algorithm with exploration mechanisms based on 
the e-greedy method as well as upper confidence bound (UCB). We extend these algo¬ 
rithms by incorporating state and action space aggregation to tackle state-action space 
explosion in the MDP. We also develop a cross entropy based method that incorporates 
policy parameterization in order to find near optimal energy sharing policies. Through 
simulations, we show that our algorithms yield energy sharing policies that outperform 
the heuristic greedy method. 

Keywords: 

Energy harvesting sensor nodes, energy sharing, Markov decision process, Q-learning, state 
aggregation. 

1 Introduction 

A sensor network is a group of independent sensor nodes, each of which senses the environ¬ 
ment. Sensor networks hnd applications in weather and soil conditions monitoring, object 
tracking and structure monitoring. Each sensor node in the network senses the environment 
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and transmits the sensed data to a fnsion node. The fnsion node obtains data from several 
sensor nodes and carries ont farther processing. 

In order to sense the environment and transmit data to the fnsion node, nodes reqnire 
energy and most often the nodes are eqnipped with pre-charged batteries for this pnrpose. 
However, as the nodes exhanst their battery power and stop sensing, the network performance 
degrades. The lifetime of the network is linked to the lifetimes of the individnal nodes. 
Hence, the network becomes inoperable when a large nnmber of nodes stop sensing. Thns, 
in a network with battery operated sensor nodes, the primary intention is to enhance the 
lifetime of the network, which may often lead to a compromise in the network performance. 
Many techniqnes have been proposed, which focns on improving lifetime of networks of 
sensor nodes. One of the more recent techniques which deals with this problem is the usage 
of energy harvesting to provide a perpetual source of energy for the nodes. 

An energy harvesting (EH) sensor node replenishes the energy it consumes by harvesting 
energy from the environment (e.g., solar, wind power etc.) or other sources (e.g., body 
movements, huger strokes etc.) and converting into electrical energy. This way an EH node 
can be constantly powered through energy replenishment. So when compared to networks 
consisting of battery operated nodes, the long-term network performance metrics become 
appropriate. Thus, the goal pertaining to an EH sensor network is to reduce the average 
delay in data transmission. Even though an EH sensor node potentially has inhnite amount 
of energy, yet the energy harvested is infrequently available as it is usually location and time 
dependent. Moreover the amount of energy replenished might be lower than the required 
amount. Therefore it is important to match the energy consumption with the amount 
of energy harvested in order to prevent energy starvation. This underlines the need for 
intelligently managing harvested energy to achieve the goal of good network performance. 

A drawback associated with an EH sensor (node) is that it requires additional circuitry 
to harvest energy, which increases the cost of the node. A network which contains several 
such nodes is not economically viable. The cost of the network can be minimized if there 
exists a central EH source which harvests energy and shares the available energy among 
multiple sensor nodes in its vicinity. Such an architecture is incorporated in motes. A mote 
(Fig. g is a single unit on which sensors with different functionalities are arranged (see H), 
For instance, there could be pressure sensors, temperature sensors etc., in the same unit to 
make different sets of measurements simultaneously. Alternatively, the sensors could be of 
the same functionality but deployed together at different angles in order to have a 360° view 
of the entire sensing region. 

Each of these sensors (within a unit) have their own data buffers and a common EH 
source feeds energy to each of the data queues. Usually, the EH source is a battery which 
is recharged by energy harvesting. The sensors in the mote are perpetually powered, but 
only if the energy harvested in the source is efficiently shared. Thus there is a need for a 
technique that dynamically allocates energy to each of the data buffers of individual sensors 
in order that the average queue lengths (or transmission delays) across the data buffers are 
minimized. 

In this paper, we focus on the problem of developing algorithms that achieve efficient 
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Figure 1: Mote with pressure, humidity and temperature sensors (Courtesy: Advanticsys 
Pvt Ltd. and UC Berkeley) 


energy allocation in a system comprising of multiple sensor nodes with their own data buffers 
and a common EH source. Another scenario (that however we do not consider here) where 
our techniques are applicable is the case of downlink transmissions [22] , where a base station 
(BS) maintains a separate data queue for each individual sensor node. The BS in question 
would also typically be powered by a single EH source, and again the problem would be to 
dynamically allocate the available energy to each one of the data queues. As suggested by 
a reviewer of the journal version of this paper, the above is equivalent to a communication 
setup with with an energy harvesting transmitter and n receivers which are connected to 
the transmitter over orthogonal links and equal gain links. The transmitter employs n hnite 
data buffers to store incoming data, intended for the n receivers and must optimally allocate 
its energy to transmit data intended for the n receivers. 

We present learning algorithms for a controller which has to judiciously distribute the 
energy amongst the competing nodes. The controller decides on the amount of energy to be 
allocated to every node at every decision instant considering the amount of data waiting to 
be transmitted in each of the data queues. Thus the state of the system comprises of the 
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amount of data in each of the data queues along with the energy available in the source. 
Given the system state at an instant, the controller has to hnd out the best possible way 
to allocate energy to the individual nodes. The decided allocation has a bearing on the 
total amount of data transmitted at that instant as well as the amount of data that will be 
transmitted in the future. Our algorithms help the controller learn the optimal allocation for 
every state, one which reduces the buildup of data in the data buffers. In the algorithm we 
present, the controller systematically tries out several possible allocations feasible in a state, 
before learning the optimal allocation. This method is computationally efficient for small 
number of states. However it becomes computationally expensive when there are numerous 
states. We propose approximation algorithms to find the near-optimal allocation of energy 
in this scenario. In the following subsection, we survey literature on EH nodes and energy 
management policies employed in EH sensor networks. 

1.1 Related Work 

Optimizing energy usage in battery-powered sensors is addressed in [331135 • The problem 
of designing appropriate sensor schedules of sensor data transmission is discussed in [33] . 
A schedule of data transmission indicates when the battery-powered sensor transmits data. 
Transmitting data uses up energy, while not transmitting data results in error in estimation 
of parameters dependent on the sensor data. The authors in [ 33 ] consider battery-powered 
sensor nodes, each of which needs to minimize the energy utilized for data transmission. 
The estimation of parameters dependent on the sensor data may however involve error if 
the sensor does not transmit data for long periods of time. The objective in [33] is to find 
optimal periodic sensor schedules which minimize the estimation error at the fusion node 
and optimize energy usage. 

In [ 33 ], the authors consider battery-powered sensors with two transmission power levels. 
The transmission power levels have different packet drop rates with the higher transmission 
power level having a lower packet drop rate. The sensor can choose one of the power levels 
for data transmission. It is assumed that the fusion node sends an acknowledgment (ACK 
or NACK) to the sensor node which indicates whether the data packet has been received 
or not. The objective in [ 33 ] is to minimize the average expected error in state estimation 
under energy constraint. At time k, based on the communication feedback the sensor knows 
whether the previous packets have been received by the fusion node or not. The problem 
of choosing the transmission power level is modeled as a MDP and the optimal schedule is 
shown to be stationary. The works [331 [ 33 ] consider the problem of efficient energy usage 
in battery powered sensors. The aspect of network performance is not considered in these. 
Our work deals with optimizing energy sharing in EH nodes where maximizing a network 
performance objective is the primary goal. 

An early work in rechargeable sensors is [H] . The authors of [18] present a framework for 
the sensor network to adaptively learn the spatio-temporal characteristics of energy avail¬ 
ability and provide algorithms to use this information for task sharing among nodes. In ini. 
the irregular and spatio-temporal characteristics of harvested energy are considered. The 
authors discuss the conditions for ensuring energy-neutral operation, i.e., using the energy 
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harvested at an appropriate rate such that the system continues to operate forever. Practical 
methods for a harvesting system to achieve energy-neutral operation are developed. Com¬ 
pared to [mill], we focus on minimizing the delay in data transmission from the nodes and 
also ensuring energy neutral operation. 

The scenario of a single EH transmitter with limited battery capacity is considered in 
[m [2B]- In [2n], the transmitter communicates in a fading channel, whereas in [TT], no 
specihc constraints on the channel are considered. The problem of Ending the optimal trans¬ 
mission policy to maximize the short-term throughput of an EH transmitter is considered 
in HU. Under the assumption of an increasing concave power-rate relationship, the short¬ 
term throughput maximizing transmission policy is identified. In [26], the transmitter gets 
channel state information and the node has to adaptively control the transmission rate. The 
objective is to maximize the throughput by a deadline and minimize the transmission com¬ 
pletion time of a communication session. The authors in [26] develop an online algorithm 
which determines the transmit power at every instant by taking into account the amount of 
energy available and channel state. 

The efficient usage of energy in a single EH node has been dealt with in some recent 
works [23 ESI EHl SS]. A channel and data queue aware sleep/active/listen mechanism in 
this direction is proposed in [25]. Listen mode turns off the transmitter, while sleep mode 
is activated if channel quality is bad. The node periodically enters the active mode. In 
the listen mode, the queue can build up resulting in packets being dropped. In the sleep 
mode, incoming packets are blocked. A bargaining game approach is used to balance the 
probabilities of packet drop and packets being blocked. The Nash equilibrium solution of 
the game controls the sleep/active mode duration and the amount of energy used. 

The model proposed in [361 EQ] considers a single EH sensor node with finite energy 
and data buffers. The authors assume that data sensed is independent across time instants 
and so is the energy harvested. The amount of data that can be transmitted using some 
specified energy is modeled using a conversion function. In [SB] , a linear conversion function 
is used and optimal energy management policies are provided for the same. These policies 
are throughput optimal and mean delay optimal in a low SNR regime. However, in the 
case of non-linear conversion function, [36] provides certain heuristic policies. In [30], a non¬ 
linear conversion function is used. The authors therein provide simulation-based learning 
algorithms for the energy management problem. These algorithms are model-free, i.e., do not 
require an explicit model of the system and the conversion function. Unlike 
our work deals with multiple sensors sharing a common EH power source. The maximization 
objective is the delay in data transmission from the nodes. However, channel constraints are 
not addressed in our work. 

Data packet scheduling problems in EH sensor networks are considered in |1U] and [13 • 
It is assumed in [16] that a single EH node has separate data and energy queues, while 
the data sensed and energy harvested are random. The same assumption is made for each 
sensor in a two-sensor communication system considered in ra. For simplicity it is as¬ 
sumed that all data bits have arrived in the queue and are ready for transmission, while the 
energy harvesting times and harvested energy amounts are known before the transmission 
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begins. In [IS]([IS]) the objective is to minimize the time by which all data packets from the 
node(s) are transmitted (to the fnsion node). It is proposed to optimize this by controlling 
the transmission rate. The anthors develop an algorithm to hnd the transmission rate at 
every instant, which optimizes the time to transmit the data packets. A two-nser Ganssian 
interference channel with two EH sensor nodes and receivers is considered in [32]. This pa¬ 
per focnses on short-term snm thronghpnt maximization of data transmitted from the two 
nodes before a given deadline. The anthors provide generalized water-hlling algorithms for 
the same. In contrast to the models developed in jlHl SSI 02], onr model assnmes mnltiple 
sensors sharing a common energy sonrce. The data and energy arrivals are nncertain and 
nnknown. Moreover the problem we deal with has an inhnite horizon, wherein the objective 
is to rednce the mean delay of data transmission from the nodes. We develop simnlation 
based learning algorithms for this problem. 

Cooperative wireless network settings are considered in [TOl 051 03] . Three different net¬ 
work settings with energy transfer between nodes are considered in [15] . Energy management 
policies which maximize the system thronghpnt within a given dnration are determined in all 
the three cases. A water-hlling algorithm is developed which controls the how of harvested 
energy over time and among the nodes. In [13], there exists an EH relay node and mnltiple 
other EH sonrce nodes. The sonrce nodes have inhnite data bnher capacity. The relay node 
transfers data between the sonrce and destination nodes. The sonrce and relay nodes can 
transfer energy to one another. A snm rate maximization problem in this setting is solved. 
In [To], mnltiple pairs of sonrces and destinations commnnicate via an EH relay node. The 
EH relay node has a limited battery, which is recharged by wireless energy transfer from the 
sonrce nodes. The EH relay node has to efficiently distribnte the power obtained among the 
mnltiple nsers. The anthors investigate fonr diherent power allocation strategies for ontage 
performance (ontage is an event in which data is lost dne to lack of battery energy or trans¬ 
mission failnres cansed by channel fades). We do not consider energy cooperation between 
nodes in the sensor network. Moreover, we do not assnme wireless energy transfer in onr 
model. 

A mnlti-nser additive white Ganssian noise (AWGN) broadcast channel comprising of a 
single EH transmitter and M receivers is considered in [28]. The EH transmitter harvests 
energy from the environment and stores in a qnene. The transmitter has M data qnenes, 
each of which stores data packets intended for a specihc receiver. The data qnenes have 
hxed nnmber of bits to be delivered to the receiver. The objective in [2B] is to hnd a 
transmission policy that minimizes the time by which all the bits are transmitted to the 
receivers. An optimization problem is formnlated and strnctnral properties of the optimal 
policy are derived. In onr work, we model energy sharing in mnltiple nodes when there is a 
single power sonrce. We assnme nncertain data and energy arrival processes. The objective 
is to minimize the average delay in data transmission from the nodes, when there is data 
arrival at every instant. 


6 


1.2 Our Contributions 

• We consider the problem of efficient energy allocation in a system with multiple sensor 
nodes, each with its own data buffer, and a common EH source. 

• We model the above problem as an inhnite-horizon average cost Markov decision pro¬ 
cess (MDP) |1],[32] with an appropriate single-stage cost function. Our objective in 
the MDP setting is to minimize the long-run average delay in data transmission. 

• We develop reinforcement learning algorithms which provide optimal energy sharing 
policies for the above problem. The learning procedure used does not need the system 
knowledge such as data and energy rates or cost structure and learns using the data 
obtained in an online manner. 

• In order to deal with the dimensionality of the state space of the MDP, we present 
approximation algorithms. These algorithms hnd near-optimal energy distribution 
prohles when the state-action space of the MDP becomes unmanageable. 

• We demonstrate through simulations that the policies obtained from our algorithm 
are better than the policies obtained from a heuristic greedy method and a combined 
nodes Q-learning algorithm (see Section]^. 

1.3 Organization of the Paper 

The rest of the paper is organized as follows. The next section describes the model, related 
notation and assumptions. Section formulates the energy sharing problem as an MDP. 
Section [^presents the RL algorithms used for solving the MDP. Section [^highlights the need 
for approximate policies and gives a detailed explanation of the approximation algorithms we 
develop for the problem. Section [^presents the simulation results of our algorithms. Section 
[^provides the concluding remarks and possible future directions. Finally, an appendix at 
the end of the paper contains the proof of two results. 


2 Model and Notation 

We consider the problem of sharing the energy available in an energy harvesting source 
among multiple sensor nodes. We present a slotted, discrete-time, model (Fig. [^ for this 
problem. A sensor node in the network senses a random held and stores the sensed data 
in a hnite data buffer of size order to transmit the sensed data to a fusion (or 

central) node, the sensor node needs energy, which it obtains from an energy harvesting 
source. The energy harvesting source has an energy buffer of hnite capacity The 

common EH source is an abtract entity in the model. It is generally a rechargeable battery 
which is replenished by random energy harvests. We assume fragmentation of data packets 
(huid model) as in [36] and hence these will be treated as bit strings. 
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Figure 2: The System Model 


Let ql denote the data buffer level of node i and Ek be the energy buffer level at the 
beginning of slot k. Sensor node i generates bits of data by sensing the random field. 
The source harvests units of energy. Based on the data queue levels {ql ,..., and the 
energy level Ek, the energy sharing controller decides upon the number of energy bits to be 
provided to every node. Let units of energy be provided to node i in slot k. Using it, 
the node transmits g{Tj^) bits of data. We have assumed the function g to be monotonically 
non-decreasing and concave as with other references ([Ml sa Eg [271E]). Note that the 
Shannon Channel capacity for Gaussian channels gives such a conversion function and in 
particular, 

9{Tk) = ^log(l + /3Tfc), 


where /3 is a constant and (3Tk gives the Signal-to-Noise (SNR) ratio. This is a non-decreasing 
concave function. We have assumed this form in the simulation experiments. However, our 
algorithms work regardless of the form of the conversion function and will learn the optimal 
energy sharing policy for any form of conversion function (see Remark 15). 

It should be noted that we do not consider wireless energy transfer from the source node 
to the sensor nodes. Here we consider the source node to be a rechargeable battery which 
powers the nodes. The queue lengths in the data buffers evolve with time as follows: 


Ql+I = (4 - giwy + Xi l<i<n,k>Q, 
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where {(fk — g{T^))^ = nicix(g^ — 0) and the energy buffer queue length evolves as given 

below: ^ 

Ek+i = yEk — j + hfc, 1 < i < n, fc > 0, (2) 

i=l ' 
n 

where ^ < Ek. 

i=l 

Assumption 1. The generated data rates at time k + l,Xfc+i = ..., 

where n denotes the number of sensors in a node, evolves as a jointly Markov process, i.e., 

Xk+1 = f\Xk,Wk), k>0 ( 3 ) 

where is some arbitrary vector valued function with n components and {Wk, k >1} is a 
noise sequence with probability distribution P{Wk \ Xk) depending on Xk- Thus, the gen¬ 
erated data {Xk, A: > 0} is both spatially and temporally correlated. Moreover, the sequence 
Xf) k >t) satisfies sup IE[X^*] < r < oo. Further, the energy arrival process evolves as: 

k>0 


Yk+i = f{Yk,Vk), k>0, (4) 

where p is some scalar valued function and {14, k > 1} is the noise sequence with probability 
distribution PiVk \ Yk) depending on Yk- 

Remark 1. Assumption 1 is general enough to cover most of the stochastic models for 
the data and energy arrivals. A special case of Assumption 1 is to consider that for any 
k > 0 and 1 < i < n, XI is independent o/X^®_i, X;*_ 2 ) • • • the given sequence 

{Xl}k>o for a given i G (1,... ,n} is identically distributed. Similarly, for any k > 0, Yk 
is independent of Yk-i,Yk- 2 , ■ ■ ■ ,Yi,Yq and the sequence {Yk} is identically distributed. In 
Section^ we show results of experiments where the above i.i.d setting as well as a more 
general setting as described earlier are shown. 

3 Energy Sharing Problem as an MDP 

A Markov decision process (MDP) is a tuple of states, actions, transition probabilities and 
single-stage costs. Given that the MDP is in a certain state, and an action is chosen by the 
controller, the MDP moves to a ‘next’ state according to the prescribed transition proba¬ 
bilities. The objective of the controller is to select a sequence of actions as a function of 
the states in order to minimize a given long-term objective (cost). We formulate the en¬ 
ergy sharing problem in the MDP setting using the long-run average cost criterion. The 
MDP formulation requires that we identify the states, actions and the cost structure for the 
problem, which is described next. 

The state Sk is a tuple comprising of the data buffer level of all sensor nodes, the level 
of the energy buffer in the source, the data and energy arrivals in the past. Note that for 
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I < i < n, ql e {0,1,... Similarly Ek G {0,1,..., Thus in stage k, in 

the context of Assumption 1, state Sk = {ql,ql,... ,qk, Ek, Xk-i, Tfc-i). However, when we 
assume that for 1 < i < n, and {Yk} are i.i.d (as in Remark 1), then the state tuple 

simplihes to Sk = {ql, ql, ■ ■ ■, ql, Ek). 

The set of all states is the state-space, which is denoted by S. Similarly A denotes the 
action-space, which is the set of all actions. The set of feasible actions in a state Sk is denoted 
by A(sfc). A deterministic policy tt = {Tk, /c > 0} is a sequence of maps such that at time k 
when state Sk = {ql, ■■■,(11, ^k, Xk-i, Yk-i), i.e., when there are ql units of data at node j, 
^ Y j Y n and Ek bits of energy in the source, Xk is the data arrival vector and Yk is the 
energy harvested at time k — 1, then Tk{sk) = {Ti}{sk),T^{sk), ■ ■ ■ ,Tl^{sk)) gives the number 
of energy bits to be given to each node at time k (i.e., it gives the energy split). Thus the 
action to be taken in state Sk is given by Tk{sk) G A(<Sfc). A deterministic policy which does 
not change with time is referred to as a stationary deterministic policy (SDP). We denote 
such a policy tt as tt = (T, T,...), where T{sk) is the action chosen in state We set the 
single-stage cost c{sk,T{sk)) as a sum of the number of bits in the data buffers. Thus, 

n 

c{sk,T{sk)) = ^ql^ (5) 

i=l 

Remark 2. In order to formulate the energy sharing problem in the framework of MDP, we 
require the state sequence {sfc = {ql, ql, ■ ■ ■, ql, Ek)}k>ei under a given policy to he a Markov 
chain, i.e., 

-^("^fc-l-l I ^k, ^k—1, ■ ■ ■ , ^0, ^) P{^k+1 I ^k, ^) ■ 

We have generalized the assumption on {XI, 1 < i < n}k>o and ond consider jointly 

Markov data arrival and Markovian energy arrival processes. Remark 1 applies to the i.i.d 
case. If we assume the data arrivals for a fixed i G {1,2,... ,n} and the energy 

arrivals {Yk}k>f) ore i.i.d, then the Markov assumption can be seen to be easily satisfied. 

The Markov property for the state evolution {sfc}fc>o is necessary as we can only search 
for policies based only on the present state of the system. Otherwise, the policies will he based 
on the entire history. The search for optimal policies in the space of history based policies is 
a computationally infeasible task. 

In the general case where {Xk} is jointly Markov, note that the state sequence {sA:}fc>o 

under a given policy will not he a Markov chain. Now consider the augmented state Sk = 

f Sk \ 

I Xk-i . Now, under a given policy n = {T,... ,T), the state evolution can he described as 

\Yk.J 

{qI - 9{T\sk)))^ + Xf 

{qn - g(T-{sk)))++ X- 
{Ek-Eti Tfisk))+Yk, 
f{Xk.uWk-i) 

P{Yk-uVk-i) 
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This can he written as Sfc+i = T(sfc), V^-i) for suitable vector valued function h. 

This is the standard description for the state evolution for an MDP (see Chapter 1 in W- 
Since the probability distribution of the noise Wk-i (Vk-i) depends only on X^-i (Yk-i), the 
augmented state sequence Sk = {{sk, Xk-i,Yk-i)}k>o forms a Markov chain. This facilitates 
search for policies only based on the present augmented state. 


Remark 3. The sensor node may generate data as packets, but in the model we allow for 
arbitrary fragmentation of data during transmission. Hence packet boundaries are no longer 
relevant and we consider hit strings. This is the fluid model as described in HJ, The data 
is considered to be stored in the data buffers as bit strings and hence the data buffer levels 
are discrete. The fluid model assumption (data discretization) has been made in [i^j 
For energy harvesting we consider energy discretization. Energy discretization implies that 
we have assumed that discrete levels of energy are harvested and stored in the queue. Energy 
discretization has been considered in some previous works Owing to these assumptions 

on data generation and energy harvesting, the state space is discrete and finite. 


The long-run average cost of an SDP vr is given by 


A" 


lim 

m—^oo 


E 


^ m—1 

— y'c(sfc,T(sfc)) 


(7) 


In contrast, a stationary randomized policy (SRP) is a sequence of maps ip = {'0,'0; ■ ■ •} 
such that for a state Sk, fisk, •) is a probability distribution over the set of feasible actions 
in state Sk- Such a policy does not change with time. The single-stage cost d{sk) of an SRP 
ip is given by 

d{sk) = f’isk,a)c{sk,a), ( 8 ) 

aeA(sfe) 

where a gives the energy split in state Sk- The long-run average cost of an SRP (p is 


^ III — j. 

= lim —(9) 

m^oo 777 , ^ ^ 

k=0 

We observe that the term q], in (|^ does not include the effect of action explicitly. Hence 
we modify the cost function to include the effect of the action taken explicitly into the cost 
function. In order to enable reformulation of the average cost objective in the modihed form, 
we prove the following lemma. Dehne 


A" 


lim E 

m^oo 


-i m—1 n 

k=0 i=l 


( 10 ) 


Lemma 1. Let q\, ^ <i <n, Tflsk), I <i < n and g be as before and let E[X*], 1 < i <n 
denote the mean of the i.i.d random variables X*, 1 < z < n. Then 


for all policies vr. 


A" 


A^ 


E eRI 
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Proof. Using state evolution equations 0 - 0 . 
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,i=l k=0 
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^ m—1 
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.A:=0 
^ m—1 

. fc =0 

m—1 

k=0 
m—1 

/c=0 


m—1 
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/c=0 

lim E 

m^oo 


- E [X*] 


m—1 
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E ^1 

/c=0 


+ lim E 

m^oo 


m “ P 
E^[E 


- E [X*] 


2=1 


X - ^ E [X*] . 


2 = 1 


The second last equality above follows from the fact that lim E T— (g* — 

m—>-oc 

claim follows. 


= 0. The 


□ 


The linear relationship between and enables us to define the new single-stage cost 
function as: 


c{s,,n) = J2(4-9{T‘{s,)))r 


(11) 


2 = 1 


With this single-stage cost function, the long-run average cost of an SDP tt is given by 

, m—1 


A"" = lim E 


m 


^c(sfc,T(sfc)) 


k=0 


( 12 ) 


12 




























( 13 ) 


The single-stage cost d{sk) of an SRP ip is given by 

d{sk) = ^ ^p{sk,a)c{sk,a), 

aeA(sfc) 


where a gives the energy split. The long-run average cost of an SRP (p is 


-| lib —X 

A‘^ = lim — d{sk)- 

m^oo tJl 

k=0 


(14) 


It can be inferred from Lemma that a policy which minimizes the average cost in ( [IT| ) 
(or (14)) will also minimize the average cost given by ([^ (or (|^). In this paper we are 
interested in hnding stationary policies (deterministic or randomized) which optimally share 
the energy among a set of nodes. Therefore our aim is to hnd policies which minimize the 
average cost per step, when the single-stage cost is given by S' 

Any stationary optimal policy minimizes the average cost of the system over all policies. 
Let TT* be an optimal policy and 11 be the set of all policies. The average cost of policy tt* 
is denoted A*. Then 


A* = 


inf A" 

ttGII 


The policy corresponding to the above average cost minimizes the sum of (data) queue lengths 
of all nodes. By Little’s law, under stationarity, the average sum of data queue lengths at 
the sensor nodes is proportional to the average waiting time or delay of the arrivals (bits). 
Hence an average cost optimal policy minimizes the stationary mean delay as well. 

The class of stationary deterministic policies is contained in the class of stationary ran¬ 
domized policies and in the system we consider, an optimal policy is known to exist in the 
class of stationary deterministic policies. We provide an algorithm which hnds an optimal 
SDP. The algorithm is computationally efficient for small state and action spaces. However 
for large state-action spaces, the algorithm computations are expensive. To mitigate this 
problem, we provide approximation algorithms which hnd near-optimal stationary policies 
for the system. These algorithms are described in the following sections. 


4 Energy Sharing Algorithms 

4.1 Background 

Consider an optimal SDP tt* for the energy sharing MDP. Then A* corresponds to the average 
cost of the policy tt*. Suppose v is a reference state in the MDP. For any state i E S, let h*{i) 
be the relative (or the differential) cost dehned as the minimum of the difference between 
the expected cost to reach state A from i and the expected cost incurred if the cost per stage 
was A*. The quantities A* and G S satisfy the Bellman Equation: 


X* + h*iz) 


min 


c[i,a)+ 

jes 


(15) 
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where p{i,a,j) is the probability that the system will move from state i to state j under 
action a. We denote by Q*{i,a), the optimal differential cost of any feasible state-action 
tuple {i, a) as follows: 

Q*{i,a) = c{i,a) + '^p{i,a,j)h*{j). (16) 

j&s 


Equation (15) can now be rewritten as 


or alternately 


X* + h*{i)= minQ*(i,a), Vi G S' 

aGA(i) 


h*{i) = min Q*{i, a) — X*, Vi G S. 

aGA{i) 


Plugging (18) into (16), one obtains 


Q* (b a) = c(i, a) + ^ p(i, a, j) 

j&s 


min - A* 

feeA(j) 


or 


A* -|- Q*{i, a) = c{i, a) + p{i, a,j) min Q*{j, b), Vi G S, Va G ^(i). 


165 


feeA(i) 


(17) 

(18) 

(19) 

( 20 ) 


Equation (20) is also referred to as the Q-Bellman equation. The important thing to note is 


that whereas the Bellman equation (15) is not directly amenable to stochastic approximation, 


the Q-Bellman equation (20) is; because of the fact that the minimization operation in (20) 


is inside the conditional expectation unlike (15) (where it is outside of it). If the transition 


probabilities and the cost structure of the system model are known, then (20) can be solved 


using dynamic programming techniques m- When the system model is not known (as in the 
problem we study), the Q-learning algorithm can be used to obtain optimal policies. This 


learning algorithm solves (20) in an online manner using simulation to obtain an optimal 


policy. It is described in the following subsection. 


4.2 Relative Value Iteration based Q-Learning 

Q-learning is a stochastic iterative, simulation-based algorithm that aims to hnd the Q*{i, a) 
values for all feasible state-action pairs {i,a). It is a model-free learning algorithm and 
proceeds by assuming that the transition probabilities p{i,a,j) are unknown. Initially Q- 
values for all state-action pairs are set to zero, i.e., Qo{i,a) = 0,Vi G S', a G A{i). Then 
Vfc > 0, the Q-learning update P for a state-action pair visited during simulation is carried 
out as follows: 

(5fc+i(b a) = (1 - a{k))Qk{i, a) a(k) I c(i, a) + min Qk(j, b) - min Qk{ir, u) ] , (21) 

\ b&A{j) u&A{ir) J 

where i is the current state at decision time k and v is the reference state. The action in state 
i is selected using one of the exploration mechanisms described below. State j corresponds 
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to the ‘next’ state that is obtained from simulation when the action a is selected in state i. 
Also, a{k), fc > 0 is a given step-size sequence such that a{k) > 0,V/c > 0 and satishes the 
following conditions: 




= oo 


and 


< oo. 


k-l 

Let t{k) = ^ ^ 1) with f(0) = 0. Then, t{k), k > 0 corresponds to the “timescale” 

i=0 

of the algorithm’s updates. The hrst condition above ensures that t{k) —)■ oo as /c —)■ oo. 
This ensures that the algorithm does not converge prematurely. The second condition makes 
sure that the noise asymptotically vanishes. These conditions on step sizes guarantee the 
convergence of Q-learning to the optimal state-action value function, see [Tj for a proof of 
convergence of the algorithm. The update (21) is carried out for the state-action pairs visited 
during simulation. The exploration mechanisms we employ are as follows: 


1. e-greedy: In the energy sharing problem, the number of actions feasible in every state is 
hnite. Hence there exists an action am for state i such that Qk{i, CLm) < Qk{i, a), \/a G 
A(i), \/k > 0. We choose e G (0,1). In state i, action is picked with probability 
1 — e, while any other action is picked with probability e. 


2 . 


UCB Exploration: Let Ni{k) be the number of times state i is visited until time k. 
Similarly let Ni^a{k) be the number of times action a is picked in state i upto time 
k. The Q-value of state-action pair {i,a) at time k is Qk{iiCi)- When the state i is 
encountered at time A:, the action for this state is picked according to the following 
rule: 


a 


/ 


arg max 

a^A{i) 


—Qki'i, a) + P 


l\nN,{k) 


( 22 ) 


where /3 is a constant. The hrst term on the right hand side gives preference to an 
action that has yielded good performance in the past visits to state i, while the second 
term gives preference to actions that have not been tried out many times so far, relative 
to InNi{k). 


Remark 4. The convergence rates for the discounted Q-learning have been studied in 

The finite-time bounds to reach an e-optimal policy by following the Q-learning rule 
are given in In the Q-learning algorithm, to explore the value of different states 

and actions, one needs to visit each state-action pair infinitely often. However, in practice, 
depending on the size of the state-action space, we need to simulate the Q-learning algorithm 
so that each state-action pair is visited a sufficient number of times. In our experiments for 
the case of two sensor nodes, the size of the state space is of the order of 10^ and we ran our 
algorithm for 10® iterations. 


Once we determine Q*{i,a) for all state-action pairs, we can obtain the optimal action 
for a state i by choosing the action that minimizes Q*{i, a). So 


a* = argminQ*(h a). 

aGA(z) 


(23) 
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It should be noted that the Q-learning algorithm does not need knowledge of the cost struc¬ 
ture and transition probabilities, and it learns an optimal policy by interacting with the 
system. 


5 Approximation Algorithms 


The learning algorithm described in Section]^ is an iterative stochastic algorithm that learns 
the optimal energy split. This method requires that the Q{s,a) values be stored for all 


(s, a) tuples. The values of Q{s, a) for each (s, a) tuple are updated in (21) over a number of 


iterations using adequate exploration. These updations play a key role in hnding the optimal 
control for a given state. Nevertheless for large state-action spaces these computations are 
expensive as every lookup operation and updation require memory access. For example, 
if there are two nodes sharing energy and buffer sizes are = 30, then the 

number of (s, a) tuples would be of the order 10®, which demands enormous amount of 
computation time and memory space. This condition is exacerbated when the number of 
nodes that share energy increases. For instance, in the case of four nodes sharing energy 
with = 30, we have [S' x y4| 30®. Thus, we have a scenario where the 

state-action space can be extremely large. 

To mitigate this problem, we propose two algorithms that are both based on certain 
threshold features. Both algorithms tackle the curse of dimensionality, by reducing the 
computational complexity. We describe below our threshold based features, following which 
we describe our algorithms. 


5.1 Threshold based Features 


The fundamental idea of threshold based features is to cluster states in a particular manner, 
based on the properties of the differential value functions. The following proposition proves 
the monotonicity property of the differential value functions for the scenario where there is 
a single node and an EH source. This simple scenario is considered for the sake of clarity in 
the proof. 

Proposition 1. Let H*{q,E) be the differential value of state {q,E). Let q < q^ < Dmax 
and Emax > E^ > E, respectively. Then, 

H*{q,E)<H*{q^,E), (24) 

H*{q,E)>H*{q,E^). (25) 


Proof. Let J(s) be the total cost incurred when starting from state s. Dehne the Bellman 
operator L ; M” —)■ M” as 


(L J)(s) = ^mjn^(c(s,T) + E[J(s')]), 


where s' corresponds to the next state after s and T corresponds to the action taken in state 


s. As noted in Section 5.1, we show the proof for a single node and EH source. The proof 
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can be easily generalized to mnltiple nodes. Thus the state s corresponds to the tuple (g, E). 
Hence the above equation can be rewritten as 


{LJ){q,E) 


min 

TeA(s) 


c{q,E,T) + E J{q ,E') 


W{q,E)eS. 


We consider the application of the operator L on the differential cost function H{-). We set 
out to prove this proposition using the relative value iteration scheme (see 0 ). For this, we 
set a reference state r = (g^, Er) G S. The cost function in our case is (g — g(T))+. Initially 
the differential value function has value zero for all states {q,E) G S, i.e., H{q,E) = 0, 
'i{q,E) G S. Then for some arbitrary {q,E) G S' we have 


LH{q,E) 


^ min^) (^(g - g(T))+ + E H{q, E') ^ 
min ((g - g(T))+) - L hf(g^, Er) 

TeAiq,E) 


LH{qr,Er) 


since H(q\E') = 0, \/{q ,E') G S. Let Tm be the value of T achieving the minimum in the 
hrst term of RHS. Then 


L H{q, E) = {q-g{Tm))^-L i7(g„ Er). 

Now consider the differential value of state g^ where g^ > g. Thus, consider 

L H{q^, E) = min ((g^ - g{T))+ + E \H{q , E')]) - L Lf (g„ Er) 
= min {{q^ - g{T))+) - LH{qr,Er) 

T&A{q^,E) 

= {q^-g{TL))+-LH{qr,Er), 


where Tl is the value of T for which the minimum of the expression (g^ — g(T)) + , in the 
above equations, is achieved. We have 


L H{q, E) = {q- g{T^)) - L H{qr, Er) 

< (<? - giW) - LH{qr,Er) 

< {q^-g{TL))-LH{qr,Er) 

= LHiq^,E). (26) 


We have H{q^,E) > H{q,E) since these values are initialized to zero and from ( [26| , 
LH{q^,E) > LH{q,E). Now consider the differential value function of the state {q,E^) 
where E^ > E. 


LH{q,E^)= min ({q-g{T))++E 

T^A{q,E^) V 

= min {{q-g{T))^) - LH{qr,Er) 

T&A(q,EL) 



LH{qr,Er) 
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{q-g{TE)y -LH{q,,Er), 


where Te is the value of T for which the minimum of the expression [q — g{T))^, in the above 
equations, is achieved. We have, 


L H{q, E^) = {q- g{TE)y - L H{qr, E^) 

<{q- g{Tm)Y - LH{qr,Er) 

<LHiq,E). (27) 


Since H{q,E^), H{q,E) are initialized to zero, we have H{q,E^) < H{q,E) and from (27), 
L H{q, E^) < LH{q,E). We prove the following statements using mathematical induction: 


H{q, E) < H{q^, E) Vfc > 0, 
H{q, E) > H{q, E^) Wk > 0. 


We have seen above that the two statements are true for both k = 0 and k = 1, respectively. 
Lets consider the hrst statement and assume that the statement holds for some k. We then 
prove that it holds for (A: + 1). Consider 


L'‘+^H(q,E) 


min 

T&A(q,E) 


((«-9(T))++ («',£')]) 


L>‘H{q„Er). 


Assume Tm is the value of T at which the minimum of ((g — g(T))+ + H(q\ E')~^) is 

attained. Then, 

L>^+^H{q, E) = {{q- g{Tj)+ + E [L^H{q - g{Tj + x, E - + y)]) - H{qr, E^), 

where x, y are obtained from independent random distributions. Similarly, we get 

E) = ((g^ - g(Ti))+ + E [L^H{q^ - giU) + x, E - Tl + y)]) - L’^ i7(g., E^), 

where Tl is the value of T for which the minimum in the expression ((g^ — g{T))~^ + 
E[L*'i7(g', i7')]) is achieved. 

L’^+^H{q, E) = {{q- g{Tj)+ + E [L'^H{q - g{Tj + x, E - + y)]) - H{qr, E^) 

< ((g - g(Ti))+ + E [L'^H{q - gin) + x, E - Tl + y)]) - H{qr, E^) 

< ((g^ - g(Ti))+ + E [L^H{q^ - giU) + x, E - U + y)]) - L’^ i7(g., E^), 

since the property holds true for L’^H, i.e., L^H{q,E) < L’^H{q^,E). Thus, 

L"+i77(g, E) < ((g^ - g(Ti))+ + E [L’^H{q^ - giU) + x, E - U + y)]) - H{qr, E^) 

= L^^^H{q^,E). 


Hence, 

L^H{q, E) < L^H{q^, E) Wk > 0. (28) 
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Similarly we get, 

E^) = ((g - g(T^))+ + E [L>^H{q - giT^) +x,E^-Te + y)]) - H^q^, E^) 

< ((g - g{T^)y + E [L^H{q - g(T™) +x,E^-T^ + y)]) - L’^ H{qr, Er) 

< ((g - g(T^))+ + E [L^H{q - g{Tm) + x,E - + y)]) - H{qr, Er) 

= L'^+^H{q,E), 


hence by mathematical induction on k we get, 

L^H{q, E^) < L^H{q, E) Vk > 0. 


(29) 


As a consequence of the relative value iteration scheme ([32]), when k —)■ oo, L^H —)■ H* 
with H*{qr,Er) = X*. Thus, from (28) and (29) as A; —)• oo, we obtain 


H*{q,E)<H*iq\E) 

H*{q,E)>H*{q,E^). 


The claim now follows. 


□ 


Proposition can be easily generalized to multiple nodes in the following manner. Sup¬ 
pose there are n nodes and one EH source. Let s = (g^,..., g-^,..., g", E) and s' = 
(g^,..., g-^,..., g"", E), where g;^ > gh The states s and s' differ only in the data buffer 
queue lengths of node j, while the data buffer queue lengths of other nodes remain the same 
and so does the energy buffer level. Then it can be observed that H*{q^, ..., g-^ ,..., g", E) < 
H*{q^,..., q)^,..., g"^, E). In a similar manner, let state s" = (g^, g^,..., g", E^) and E^ > 
E. Then states s and s" differ only in the energy buffer levels. Hence H*{q^,... ,q'^,E) 
> H*{q^, ..., g”, E^). This proposition provides us a method which is useful for clustering 
states. 

Remark 5. The monotonicity property of the differential value function H* provides a justifi¬ 
cation to group nearby states to form an aggregate state. The value function of the aggregated 
state will be the average of the value function of the states in a partition. If the difference 
between values of states in a cluster is not much, the value function of aggregated state will 
be close to the value function of the unaggregated state. Thus, the policy obtained from the 
aggregated value function is likely to be close to the policy obtained from unaggregated states. 
Without the monotonicity property, states may be grouped arbitrarily and consequently, state 
aggregation may not yield a good policy. 

Remark 6. In the case of MDP with large state-action space, one goes for function ap¬ 
proximation based methods (see Chapter 8 in Ell However, if one combines Q-learning 
with function approximation, we do not have convergence guarantees to the optimal policy 
unlike Q-learning without function approximation (Q-learning with tabular representation 
HI However, when Q-learning is combined with state-aggregation (QL-SA) we continue 
to have convergence guarantees (see Section 6.7 in W- Q-learning using state aggregation 
can produce good policies only when the value function has a monotonicity structure, which 
is proved in the previous remark. 
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5.1.1 Clustering 


The data and energy buffers are quantized and using this we formulate the aggregate state- 
action space. The quantization of buffer space is described next. We predehne data buffer and 
energy buffer partitions (or quantization levels) di,d 2 , ■ ■ ■ ,ds and Ci, 62 ,..., respectively. 
The partition (or quantization level) dj, (i G {1,..., s}) corresponds to a given range (x*, x*) 
and is hxed, where x* and x* represent the prescribed lower and upper data buffer level 
limits. In a similar manner the quantization level ej,{j G {l,...,r}) (or energy buffer 
partition) corresponds to a given interval where y* and y^ represent the prescribed 

lower and upper energy buffer level limits. As an example, suppose = 10 and 

each of the buffers are quantized into three levels, i.e., s = r = 3. An instance of data and 
energy buffer partition ranges in this scenario can be ul = = 0 ,|/^ — ~ ~ ~ 

4, = y^ = 7, = y^ = 8, y^ = x^ = 10. Here Partition 1 corresponds to the number of 

data (energy) bits (units) in the range (0, 3), while Partition 3 corresponds to the number of 
data (energy) bits (units) in the range (8,10). The following inequalities hold with respect 
to the partition limits: 

^ <4 ^max and 

x^ + 1 = 1 < i < s — 1. 


Similarly, 


0 = <yl<yl<...< yl < y^ = and 

1 1 < i < r - 1. 

5.1.2 Aggregate States and Actions 

We dehne an aggregate state as s' = {/\..., /"^^}, where for 1 < i < n, t is the data buffer 
level for the node and is the energy buffer level. So /' G {1,..., s}, 1 < i < n and 
jn+i ^ { 1 ^... An aggregate action corresponding to the state s' is an n-tuple t! of the 
form i! = [t^,... ,C), where P G {1,... 1 < i < n. Each component in t' indicates 

an energy level. By considering the data level in all the nodes, the controller decides on an 
energy level for each node. Thus the energy level indicates the energy partition which can 
be supplied to the node. For instance, if = E^^^ = 15, s = r = 3 and there are two 

nodes in the system, then an example aggregate state is s' = (1,1, 3). Suppose the controller 
selects the aggregate action t = ( 2 , 1 ), which means that the controller decides to give u 
number of energy bits to Node 1, and v number of energy bits to Node 2, with y"^ < u < y^ 
and y^ < V < y^, respectively. 

5.1.3 Cardinality Reduction 

Note that s r T-max- Let the aggregated state and action spaces be denoted by 

S and A respectively. The aggregated state-action space has cardinality [S' xA|. Thus, the 
cardinality of the state-action space is reduced to a great extent by aggregation. For instance. 


20 



in the case of four nodes sharing energy from one EH source and = Dm ax — 30, the 

cardinality of the state-action space without state-aggregation is [S' x H| 30®. However, 
with four partitions each for the data and energy buffers, the cardinality of the state-action 
space with aggregation is [S' x H | 4®. 


5.2 Approximate Learning Algorithm 


We now explain our approximate learning algorithm for the energy sharing problem. It is 
based on Q-learning and state aggregation. Although the straightforward Q-learning algo¬ 
rithm described in Section [^requires complete state information and is not computationally 
efficient with respect to large state-action spaces, its state-aggregation based counterpart 
requires signihcantly less computation and memory space. Also our experiments show that 


we do not compromise much on the policy obtained either (see Fig. 9b). 


5.2.1 Method 

Let s' = {ll ,..., be the aggregate state at decision time k. The action taken in s' is 
t' = (tl,... The Q-value Q(s', t') indicates how good an aggregate state-action tuple is. 
The algorithm proceeds with the following update rule: 


Qk+i{s',t') = {l-a{k))Qk{s',t')+a{k) c(s',t') min Qk{j',b) 

beA'if) 


min Qk{r',u) 

{r') 


(30) 


where j' is the aggregate state obtained by simulating action t' in state s'. Also, r' is a 
reference state and a{k), /c > 0 is a positive step-size schedule satisfying the conditions 
mentioned in Section |4.2[ To facilitate exploration, we employ the mechanisms described in 


Section 4.2 Convergence of Q-learning with state aggregation is discussed in Section 6.7 of 


Remark 7. The aggregate state in every step of the iteration (30) is computed by knowing 
the amount of data present in each sensor node. A viable implementation would just need a 
mapping of the buffer levels to these partitions, using which the controller can compute the 
aggregate state for any combination of buffer levels. Since this method reguires storing of 
Q-value of the aggregate state-action pair and [S' xA|-c|S'xA|, the number of Q-values 
stored is much less compared to the unaggregated Q-learning algorithm. The computational 
complexity of the method described above is dependent on the size of the aggregate state- 
action space and the number of iterations reguired to converge to an optimal policy (w.r.t the 
aggregate state-action space). For instance, in the case of four sensor nodes, the size of the 
state-action space grows to ~ 30® with the data and energy buffer sizes being 30 each. The 
number of iterations that the above method reguires to find a near-optimal policy is 10® with 
six partitions of the buffer size as compared to Q-learning without state aggregation (Section 
4 . 2 ) which reguires at least 10^^ iterations. 


Remark 8. It must be observed that using (30), the controller decides the partition and 


not the number of energy bits to be distributed, i.e., it finds an optimal aggregate action for 
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every aggregate state. It follows from this that, in order to find the aggregate action for an 
aggregate state, the knowledge of the exact buffer levels is not reguired (since this is based on 
the Q-values of aggregate state-action pairs). In this manner (30) is beneficial. The optimal 
policy obtained using (30) would indicate only the energy levels. An added advantage of the 
above approximation algorithm is that the cost structure discussed in Section holds good 
here as well. 


5.2.2 Energy distribution 


Note that once an aggregate action is chosen for a state, the energy division is random 
adhering to the action levels chosen. For instance, lets assnme that there are two sensor nodes 
in the system. Data and energy buffers have three partitions each and thus s = 3, r = 3. 
Here = 0 and y)^ = . Suppose the number of energy bits in the energy buffer is z and 

those bits belong to partition 3. Let the number of data bits at nodes 1 and 2 be a: and y, 
respectively. Here x and y belong to partition 2. Hence the aggregate state is (2,2,3). The 
controller decides on the aggregate action (1, 2). Thus x(; bits of energy is provided to Node 
1, while Node 2 is given bits of energy. The remaining number of bits in the buffer will 
he r = z — {x( In order to distribute these bits, the proportions of data pi = and 

1 — Pi = -f— are computed. Each of the r bits are supplied to Node 1 with probability pi 


and to Node 2 with probability 1 — pi. If m and v represent the total number of energy bits 
provided to Nodes 1 and 2 respectively, then u < x^, v < x^ and {u — x() + {v — < r. It 

must be observed that even though an aggregate action chosen requires knowledge of only 
the aggregate state, the random distribution of energy (after a control is selected using (30)), 
is achieved by knowing the exact buffer levels. 


Remark 9. An advantage of using state-aggregation with Q-learning is that it has conver¬ 
gence guarantees (Chapter 6, Section 6.2 This overcomes the problem of basis selection 
for function approximation in the case of large state-action spaces. We have tried different 
partitoning schemes manually and all the schemes resulted in close policy performance. Also, 
we observed that increasing the number of partitions improves the policy performance (see 
Fig. 6 in Section^ . 


5.3 Cross Entropy using State Aggregation and Policy Parame¬ 
terization 

The cross-entropy method is an iterative approach ([3S]) that we apply to hnd near-optimal 
stationary randomized policies for the energy sharing problem. The algorithm searches for a 
policy in the space of all stationary randomized policies in a systematic manner. We dehne 
a class of randomized stationary policies G M^}, parameterized by a vector 6 . For 

each pair {s,a) E S x A , 7r®(s, a) denotes the probability of taking action a when the state 
s is encountered under the policy corresponding to 0. In order to follow the cross entropy 
approach and obtain the optimal 6 * G we treat each component 6**, i G {1, 2,..., M} 
of 0 as a normal random variable with mean /ij and variance cTj. We will refer to these two 
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quantities (the parameters of the normal distribution) as meta-parameters. We will tune the 


meta-parameters using the cross-entropy update rule (32) to hnd the best values of fit and 
(jj which will correspond to a mean of 6 * and a variance of zero. The cross entropy method 
works as follows: Multiple samples of 9 namely 0^,0^,... , 6 ^ are generated according to 
the normal distribution with the current estimate of the meta-parameters. Each sampled 6 
will then correspond to a stationary randomized policy. We compute the average cost X{ 6 ) 
of an SRP determined by a sample 6 by running a simulation trajectory with the policy 
parameter hxed with the sample 6 . We perform this average cost computation for all the 
sampled 6 \ i E {1,2,..., N}, i.e., we compute X{9^), X{0‘^),..., X{0^). We then update the 
current estimates of the meta-parameters based on only those sampled 0’s (policies) whose 


average cost is lower than a threshold level (see (32)). 


Remark 10. The Cross Entropy method is an adaptive importance sampling /I technique. 
The specific distribution from which the parameter 9 is sampled is known as the importance 
sampling distribution. The Gaussian distribution used as the importance sampling distri¬ 


bution yields analytical updation formulas (32) for the mean and variance parameters (see 
m)- For this reason, it is convenient to use the Gaussian vectors for the policy parameters. 


5.3.1 Policy Parameterization 

Let X{ 6 ) be the average cost of the system when parameterized by 0 = (^i,..., OmY■ An 
optimal policy 0* minimizes the average cost over all parameterizations. That is, 

0* = argmin A(0). 

0e M" 

An example of parameterized randomized policies, which we use for the experiments (involv¬ 
ing state aggregation) in this paper are the parameterized Boltzmann policies having the 
following form: 

T^\s,a)= ^ Vs e a', Va e A'(s), (31) 

beA[s) 

where cfsa is an M-dimensional feature vector for the aggregated state-action tuple (s, a) 
and (psa G . The parameterized Boltzmann policies are often used in approximation 
techniques ([H 0 m EZl [38]) which deal with randomized policies. 

Remark 11. The probability distribution over actions is parameterized by 0 in the cross 
entropy method. Since actions in every state need to be explored, the distribution needs 
to assign a non-zero probability for every action feasible in a state. Hence the probability 
distribution must be chosen based on these requirements. The Boltzmann distribution for 
action selection fits these requirements and is a frequently used distribution in the literature 
(see \3E^) on policy learning and approximation algorithms. 

As noted in the beginning of this subsection, the parameters 6 i,..., 9 m are samples from 
the distributions Ci), 1 < i < M, i.e., 9 

i N{ni,ai), Vi 
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5.3.2 Method 


Initially M parameter tuples 1 <i < M} for the normal distribution are picked. 

The policy is approximated using the Boltzmann distribution. The method comprises of 
two phases. In the hrst phase trajectories corresponding to sample 6 s are simulated and the 
average cost of each policy is computed. The second phase inolves updation of the meta 
parameters. The algorithm proceeds as follows: 

Let iteration index t be set to 1. 

First Phase: 

1. Sample parameters 0^,..., 6 ^ are drawn independently from the normal distributions 

cr-), I < i < M}. For 1 < j < iV, 6 ^ e and Oj is sampled from aj). 

2. A trajectory is simulated using probability distribution (s, a), 1 < j < A^. Hence at 
every aggregate state s an aggregate action a is picked according to (s,.). Once an 
aggregate action is chosen for a state, the energy distribution is carried out as described 
in Section [5.2.21 

3. The average cost per step of trajectory j is X{6^) and is computed for the trajectory 
simulated using OF By abuse of notation we denote X{ 6 ^) as Xj. 

Second Phase: 


4. A quantile value p G (0,1) is selected. 

5. The average cost values are sorted in descending order. Let Ai,..., Aat be the sorted 
order. Hence Ai > ... > Atv- 

6. The [(1 — py]N^^ average cost is picked as the threshold level. So, let Ac = A|-(i_p)]Ar. 

7. The meta-parameters 1 < ^ < M} are updated (refer [2l]) in this phase. In 

iteration f, the parameters are updated in the second phase in the following manner: 

N 

_ 

N ’ 




(t+l) 


2(t+l) 

Ct- 


E/{A^<A4 

J=1 


N 

j=i _ 

N 



(32) 


E/{A,<Ac} 


8. Set t = t + 1 . 
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Steps 1-6 are repeated until the variances of the distributions converge to zero. Let = 
{fii ,..., be the vector of means of the converged distributions. The near-optimal SRP 
found by the algorithm is tt where 

g/i0(s,a) 

beA(s) 

Remark 12. The computational complexity of the cross entropy method is dependent on the 
number of updations required to arrive at the optimal parameter vector and the dimension of 
the vector. For instance in the case of four nodes, with data and energy buffer sizes being 30, 
the cross entropy method requires 10^ sample trajectories for a hyperparameter (p, a) vector 
of dimension 50. The parameter 6 is updated over 10^ iterations to arrive at the optimal 
parameter vector. 

Remark 13. The heuristic cross-entropy algorithm solves hard optimization problems. It 
is an iterative scheme and requires multiple samples to arrive at the solution. In general 
one assumes that the parameter 6 is unknown (non-random variable) and uses actor-critic 
architecture to obtain locally optimal policy. However, obtaining gradient estimates in actor- 
critic architecture is hard as it leads to large variance fWf- On the other hand, in our 
work, we let the parameter 9 be a random variable and assume probability distibution over 
6 with hyperparameter (p, a) and use cross-entropy method to tune the hyperparameters. 
Cross entropy method is simple to implement, parallelizable and does not require gradient 
estimates. To the best of our knowledge, we are the first to combine the cross-entropy with 
state aggregation and apply it to a real world problem. In the authors sampled from 
the entire transition probability matrix to calculate the score function and tested on problems 
with only small state-action space. 

6 Simulation Results 

In this section we show simulation results for the energy sharing algorithms we described 
in Sections and For the sake of comparison we implement the greedy heuristic method 
in the case when the function g has a non-linear form. Also, we implement Q-learning to 
learn optimal policies for the case where we consider the sum of the data at all nodes and 
the available energy as the state. These methods are as follows: 

1. Greedy: This method takes as input the level of data ql at all nodes and supplies the 
energy based on the requirement. Since g{x) is the number of data bits that can be 
sent given x bits of allocated energy, g~^{y) gives the amount of energy required to 
send y bits of data. Suppose the energy available in the source is e*, at stage k. The 

/ ” 

greedy algorithm then provides tk units of energy, where tk = min I e*,, 'f2g~^{ql) 

V i=l 

The energy bits are then shared between the nodes based on the proportion of the 
requirement of the nodes. 


7r(s, a) 


\/s E S , a G A (s). 
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2. Combined Nodes Q-learning: The state considered here is the sum of the data at all 
nodes and the available energy. Let the state space be Sc and action space be Ac. So 

state Sk = • The control specihed is tk which is the total energy that needs 

to be distributed between the nodes. In contrast to the action space in Section here 
the exact split is not decided upon. Instead, this method hnds the total optimal energy 
to be supplied. The algorithm in Section 4.2 is then used to learn the optimal policies 
for the state-action space described here. 

In the above described methods, after an action tk is selected, the proportion of data in the 
nodes is computed. Thus p* = , 1 < i < n is computed at time k, where 0 < < 1 and 

n 

= 1. Each of the tk bits of energy is then shared based on these probabilities. Let Ui be 

i=l 

n 

the number of bits provided to node i. Then in the case of the greedy method, Y'^i — ^k, 


2 = 1 


while in the combined nodes Q-learning method, Y'^i — ^k- 

2=1 


6.1 Experimental Setup 

• The algorithms described in Section are simulated with two nodes and an energy 
source. We consider the following settings: 

1. For the case of jointly Markov data arrival and Markovian energy arrival processes, 

we consider energy buffer size of 20 and data buffer size of 10. The data arrivals 
evolve as: Xk = AXk-i + oj, where A is a hxed 2x2 matrix of coefficients and 
uj = {oji,oj 2 Y is a 2 X 1 random noise (or disturbance) vector. Here A = ([jj g;! ) 
The energy arrival evolves as W = + X) where y is also random noise (or 

disturbance) variable and b = 0.5 is a hxed coefficient. The components in vector 
LO and y are Poisson distributed. In the simulations, we vary the mean of the 
random noise variable ui, while means of a; 2 , y are kept constant. 

2. For the case of i.i.d data and energy arrivals the data and energy buffer sizes are 
hxed at 14. X^, X'^ and Y are distributed according to the Poisson distribution. 
In the simulations, the mean data arrival at node two is hxed while that at node 
one is varied. 

• The algorithms described in Section are simulated with four nodes and an energy 
source. We consider the following settings: 

1. For the case of jointly Markov data arrival and Markovian energy arrival processes, 
we consider energy buher size of 25 and data buher size of 10. The data arrivals 
evolve as: Xk = AXk-i + oj, where H is a hxed 4x4 matrix of coefficients and 
u) = {uji,uj 2 t 0 j^t 0 J 4 Y is a 4 X 1 random noise (or disturbance) vector. Here 
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/O.l 0.1 0.1 0.2 \ 

^ = ( 0 1 0 2 0 1 0 1 ) The energy arrival evolves as Yk = bYk-i + y, where y is 
Vo :2 oh oil oil / 

also random noise (or distnrbance) variable and b = 0.5 is a fixed coefficient. 
The components in vector lo and y are Poisson distribnted. In the simnlations, 
we vary the mean of the random noise variable uji, while means of U 2 -UJ 4 ,, x are 
kept constant. The energy buffer had 4 partitions while the data buffer had 2 
partitions. 

2. For the case of i.i.d data and energy arrivals, the buffer sizes are taken to be 30 
each. The data and energy buffers are clustered into 6 partitions. X^, X^, X^, Y 

are Poisson distributed. In these experiments, the mean data arrivals at nodes 
2, 3 and 4 are fixed while the same at node 1 is varied. 

For all Q-learning algorithms (e-greedy, UCB based and Combined Nodes), stepsize a = 0.1 
is used in the updation scheme. For the e-greedy method, e = 0.1 is used for exploration. In 
the UCB exploration mechanism, the value of /3 is set to 1. In our experimental simulations, 
we consider the function g(x) = ln(l -|- x) for the i.i.d case and g(x) = 21n(l -|- x) for the 
non-i.i.d case. 


6.2 Results 


Figs. 1^ Tb 1^ and 9a show the performance of the algorithms explained in Section 
The simulations are carried out with two nodes and a single source. Similarly, Figs. 
and show the performance comparisons of our algorithms explained in Section with 
other algorithms. The simulations in this case are carried out with four nodes and a single 
source. In Figs. [^jointly Markov data arrival and Markovian energy arrival processes are 
considered and the noise in data and energy arrival at Node 1, i.e. E[a;i] is varied while that 
at the other nodes is kept constant. The i.i.d case of data and energy arrivals is considered in 
Figs. ggiig and[^ In these plots, the mean data arrival at Node 1 (E[X^]) is varied 
while keeping that at the other node(s) constant. Figs. [3 9b show the normalized long-run 


average cost of the policies determined by the algorithms along the y-axis. The mean energy 
arrival is also fixed. 

The Q-learning algorithm is designed to learn optimal policies, hence it outperforms 
other algorithms, as shown in Figs. 3pa , 4bl ^and 9a The policy learnt by our algorithm 
does better compared to the greedy policy and the policy obtained from the combined nodes 
Q-learning method. Note that Q-learning on combined nodes learns the total energy to be 
distributed and not the exact split. Hence its performance is poor compared to Q-learning 
on our problem MDP. Thus, sharing energy by considering the total amount of data in all 
the nodes is not optimal. 

Figs. and show the long-run normalized average costs of the policies obtained 


from the Greedy method and the algorithms described in Section Since our algorithms 
are model-free, irrespective of the distributions of energy and data arrival (see Figs. 7a and 
7b), our algorithms learn the optimal or near-optimal policies. These plots show that our 
approximation algorithms outperform the greedy and combined nodes Q-learning methods. 
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9 10 


Figure 3: Emax = 20, Dmax = 10, (Ui, U 2 , x cire Poisson distributed with E[a; 2 ] = 
1.0,E[x] = 20, 




(a) X'^, Y are Poisson distributed with (b) X^: Poisson distributed, X^: hyperexponen- 

E[P] = 13, E[X^] = 1.0 tial distributed and Y: Exponential distributed 

and E[x 2] = 0.625, E[y] = 10 

Figure 4: Performance comparison of policies when Emax = Dmax = 14 
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Figure 5: Performance comparison of policies with Em ax = Dm ax = 30 when X^, Y 
are Poisson distributed with E[y] = 25, E[X^] = 1.0 


It can be observed that the gap between the average costs obtained from the combined nodes 


Q-learning method and the approximate learning algorithm (see Section 5.2) increases with 
an increase in the number of nodes. 


This is clear from Figs. [4a]and[7a| This occurs because 
the combined nodes Q-learning method wastes energy and the amount of wastage increases 
with an increase in the number of nodes. 

Fig. shows the variation in average cost with different number of partitions of data and 
energy buffers used in state aggregation. As the number of partitions increase, the number 
of clusters also increase resulting in better policies. 

The single-stage cost function as dehned in ( pT| , includes the effect of action in the 
conversion function g{-). The effect of the action taken can be explicitly included in the 
single-stage cost function of the following form: 


c{st,T{st)) = - g(T'(si,))y + r2 * T'(st)), 


(33) 


2=1 


where ri, r 2 are the tradeoff parameters, ri -f- r 2 = 1 and ri,r 2 > 0. The above equation is 
a convex combination of the sum of data queue lengths and the collective energy supplied 
to the nodes. It can be observed that the single-stage cost function (0 used in our MDP 
model can be derived from (33) by taking ri = 1 and r 2 = 0. When ri > 0 and r 2 > 0, the 


cost structure (33) gives importance to the data queue length as well as the amount of energy 


supplied. The performance comparison of our algorithms (described in Section with the 
greedy and combined nodes Q-learning methods using this single-stage cost function is shown 
in Fig. 9a For the simulations, buffer sizes are hxed at 14 and X^, X^, Y are distributed 


according to the Poisson distribution with E[y] = 13 and E[X^] = 1.0. 
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Figure 6: Emax = 25, Dmax = 10, cji-cus are Poisson distributed with E[a; 2 ] = = 

E[a;4] = 1.0, E[x] = 5 


In Fig. 9a, the x-axis indicates the change in data rate of Node 1. This setup is akin to 
that used in Fig. 4a The y-axis indicates the normalized average queue length of all the 
nodes. We considered values ri = 0.7 and r 2 = 0.3. The plot indicates only the average 
queue length of all nodes, since our objective is to minimize the average delay of transmission 
of data (which is related to the data queue length). From Fig. 9a, it can be observed that 


all learning algorithms show an increase in the collective average queue length (referred 
to as the normalized average cost in Figs. 4a This occurs because by using the cost 


function (33) the learning algorithms (Q-learning with UCB and e-greedy exploration as 


well as combined nodes Q-learning) give less importance to the queue length component in 
the cost function. Thus the policies learnt by these algorithms minimize the energy usage 
albeit with an increase in data queue length. As the hgure shows, the learning algorithms 
we described in Section [^perform much better compared to the greedy and combined nodes 
methods. 


In Fig. 9b, the performance comparison of Q-Iearning with and without state aggregation 
is shown for the case of two nodes and an EH source (i.i.d case) and compared with greedy and 
combined nodes Q-learning mathod. The e-greedy exploration mechanism is used for both 
algorithms. The experimental setup is similar to that used in Fig. The x-axis indicates 


the variation in data rate of Node 1, while the y-axis indicates the normalized average cost of 


the nodes. The algorithm in Section 5.2 was simulated by partitioning the data and energy 
buffers into 3 partitions each. It can be observed in Fig. 9b that Q-learning with state 


aggregation performs better than the greedy and combined nodes methods. However since 
Q-learning with state aggregation algorithm Ends near-optimal policy, its performance is not 


as good as the algorithm in Section 4.2 with the same exploration mechanism. 


Remark 14. The Greedy algorithm distributes the available energy among the sensor nodes 
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(a) X^, X^, X^, X^,Y are Poisson distributed (b) X^,X^^X^ are Poisson distributed with 
with E[x2] = E[x 3] = E[X^] = 1.0, E[y] = E[x 3] = E[X^] =0.7, is hyperexponentially 
25 distributed and Y has the exponential distribu¬ 

tion with E[y] = 20 

Figure 7: Performance comparison of policies when Em ax = Dm ax = 30 


based on the proportion of data available in the nodes. It shares all the available energy at 
every decision instant without storing it for future use. We compare our algorithms with 
the Greedy algorithm in order to show that myopic strategy may not be optimal. Our results 
show that one has to devise the policy not only for the present reguirement for energy but 
also for the future energy reguirements as well. This idea is naturally incorporated in our 
RL algorithms. Moreover, Greedy policy is optimal when the conversion function g is linear. 
This has been derived in for the case of single sensor. The performance of the algorithm 
proposed in fWf with non-linear g is compared with the performance of the greedy method. 
Thus, the comparison of the performance of our algorithms with the greedy method also 
follows naturally from the earlier cited works. 

The Gombined Nodes Q-learning method learns the policy which maps the total number of 
data bits available in all the nodes to the total amount of energy reguired. The energy sharing 
between the nodes is then based on the proportion of data available in the nodes. Under the 
Gombined Nodes Q-learning algorithm, the state space is greatly reduced, i.e., instead of the 
cartesian product of states in each node (as in our Q-learning method with and without state 
aggregation), it is just the sum of the states of the sensor nodes. So, the learning is faster in 
combined nodes Q-learning algorithm. However, the policy learnt is suboptimal as was shown 
in Figs. [Tolf^ and performs poorly in comparison with our algorithms. So, we compare our 


algorithms with Gombined nodes Q-learning to illustrate the tradeoff of size of the state space 
with the nature of the obtained policy. 

Note that our RL algorithms learn the energy sharing policy not guantized to a single 
point but considers energy sharing among the sensor nodes. Learning an optimal energy 
sharing scheme is a difficult problem. Hence, we would like to understand how well our 
algorithms perform against a simple heuristic policy such as Greedy or a policy obtained 
from the Gombined nodes Q-learning method. 
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Figure 8: Performance of QL e-greedy with different number of data and energy buffer 
partitions, when X^,Y are Poisson distributed with E[X^] = E[X^] = E[X^] = 

1.0, E[y] = 25 and Dmax = Emax = 30 


Remark 15. The function g{.) gives the number of bits that can be transmitted using cer¬ 
tain units of energy. Our algorithms work regardless of the forms of g. RL algorithms 
use the simulation samples to learn the energy sharing policy by trying out various ac¬ 
tions in each of the states. In our problem, at time k let us assume we are in state 
Sk = {ql, ql,..., q]t, Ek, Xk-i, Yk-i), i.e., the data in the data buffer and energy in the energy 
buffer are fixed to some values. Based on the current Q-value, we share the energy available 
to the various sensor nodes by selecting action = (Tf,T^,...,Tf). Depending on the 
action T^, the state of the system evolves according to (p!|)-(p|). 

In order to find the next state of the system r0 - it suffices to know the number of 
bits that got transmitted by chasing the action Tk in slot k in a real system, which is given 
by g{Tk). It must be noted that we do not need information on the functional form of g 
for finding the next state, but only the value of the function for action . This value can 
be observed (in a real system) even if we do not have the precise model for the Gaussian 
function in terms of g{-). In other words, all we need is to observe the number of bits that 
got transmitted by supplying Tk units of energy. 

To update the Q-value of state-action pair {sk,Tk) (see (21 we need to know the cost 
c{sk,Tk) incurred by choosing action Tk in state Sk, which is computed using (0 , where 
again we only require information on g(Tf), i = 1,2,.. .n, but not the exact form of gf). 
Our proposed RL algorithms work by updating Q values, and such an updation essentially 
requires the cost information (computed using 10;. Similarly in the cross entropy method, 
to compute the average cost of the policy, we need to compute the single-stage cost (using 
(11)/ In summary, our algorithms do not require the exact form of g{-). 
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Figure 9: Performance comparison of policies: Emax = Dmax = 14 when Y are 

Poisson distributed with E[F] = 13, E[X^] = 1.0 


In the case of the greedy algorithm, in order to decide the number of energy units Tk that 
need to be shared, the function and hence the functional form of g{-) must be known 

(see Section^, i.e., one needs to obtain the mathematical model for the conversion function. 
In comparison, as stated before, our algorithms do not need such information. 

However, to simulate the environment, we need to know the functional form of the 
conversion function g. But, in a real physical system, our algorihms do not reguire the 
functional form of g. Figure illustrates the performance of our algorithms and the 
Greedy and Combined nodes Q-learning methods for a different form of function q(-), i.e., 
g{-) = -^/S log(l + x). The setup is similar to that of Fig. ^ We observe from Fig. \l(i that 
irrespective of the form of g{-), our algorithms find good policies, since they do not reguire 
this knowledge to do so. 


7 Conclusions and Future Work 

We studied the problem of energy sharing in sensor networks and proposed a new technique 
to manage energy available through harvesting. Multiple nodes in the network sense ran¬ 
dom amounts of data and share the energy harvested by an energy harvesting source. We 
presented an MDP model for this problem and an algorithm that determines the optimal 
amount of energy to be supplied to every node at a decision instant. The algorithm mini¬ 
mizes the sum of (data) queue lengths in the data buffers, by hnding the optimal energy split 
prohle. In order to deal with the curse of dimensionality, we also proposed approximation 
algorithms that employ state aggregation effectively to reduce the computational complexity. 
Numerical experiments showed that our algorithms outperform the algorithms described in 
Section 
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Figure 10: Emax = 20, Dmax = 10, cui, U 2 , x Poisson distributed with E[a; 2 ] = 
1-0, E[x] = 20, g{x) = log(l + x) 


Our future work would involve applying threshold tuning for state aggregation, gradient 
based approaches and basis adaptation methods for policy approximation. The partitions 
formed for clustering the state space (Section 5.1) can be improved by tuning the partition 
thresholds (see 1311). This method can be employed to obtain improved deterministic policies 
when state-action space is extremely large. Gradient based methods 0. IZDl.i approximate 
the policy using parameter 9 and a set of given (fixed) basis functions {fk : 1 < < n]- 

Typically a probability distribution over the actions corresponding to a state is dehned using 
6 and {fk}- The parameter is updated using the gradient direction of the policy performance, 
which is usually the long-run average or discounted cost of the policy. In the approximation 
algorithm described in Section 5^, the basis functions used in the policy parameterization 
are fixed. One could obtain better policies if the basis functions are also optimized. Basis 
adaptation methods 123, 0 start with a given set of basis functions. The random policy 
parameter 9 is updated using simulated trajectories of the MDP on a faster timescale. The 
basis functions are tuned on a slower timescale. These methods can be employed to hnd 
better policies. We shall also develop prototype implementations for this model and test our 
algorithms. 
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